“Use the Tidyverse, Luke” – O-W.Kenobi
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.2.0 [32mv[30m [34mpurrr [30m 0.3.2
[32mv[30m [34mtibble [30m 2.1.3 [32mv[30m [34mdplyr [30m 0.8.3
[32mv[30m [34mtidyr [30m 0.8.3 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.4.0[39m
package 㤼㸱dplyr㤼㸲 was built under R version 3.6.1[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(skimr)
Attaching package: 㤼㸱skimr㤼㸲
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
Crossref data used from the Setup to the LC OpenRefine Workshop
crossref_data <- read_csv("https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv",
col_types = cols(Date = col_date(format = "%m/%d/%Y")))
Take a quick look at the data
glimpse(crossref_data)
Observations: 1,001
Variables: 11
$ Title [3m[90m<chr>[39m[23m "The Fisher Thermodynamics of Quasi-Probabilities", "Aflatoxin Contamination of the...
$ Authors [3m[90m<chr>[39m[23m "Flavia Pennini|Angelo Plastino", "Naveed Aslam|Peter C. Wynn", "Rafael R. C. Cuadr...
$ DOI [3m[90m<chr>[39m[23m "10.3390/e17127853", "10.3390/agriculture5041172", "10.3390/ijms161226101", "10.339...
$ URL [3m[90m<chr>[39m[23m "https://doaj.org/article/b75e8d5cca3f46cbbd63e91be5b32412", "https://doaj.org/arti...
$ Date [3m[90m<date>[39m[23m 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11...
$ Language [3m[90m<chr>[39m[23m "English", "English", "English", "EN", "EN", "English", "English", "English", "Engl...
$ Subjects [3m[90m<chr>[39m[23m "Fisher information|quasi-probabilities|complementarity|Physics|QC1-999|Science|Q",...
$ ISSNs [3m[90m<chr>[39m[23m "1099-4300", "2077-0472", "1422-0067", "2304-6740", "2306-5338", "1420-3049", "2073...
$ Publisher [3m[90m<chr>[39m[23m "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI ...
$ Citation [3m[90m<chr>[39m[23m "Entropy, Vol 17, Iss 12, Pp 7848-7858 (2015)", "Agriculture (Basel), Vol 5, Iss 4,...
$ Licence [3m[90m<chr>[39m[23m "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "C...
crossref_data
Skimr is a easy way to have a quick look at the variables in the data frame. In this case the data are mostly character string data. With numeric data skimr will produce a thumbnail histogram (sparkline )
skim(crossref_data)
Skim summary statistics
n obs: 1001
n variables: 11
-- Variable type:character -----------------------------------------------------
variable missing complete n min max empty n_unique
Authors 0 1001 1001 7 291 0 883
Citation 0 1001 1001 39 104 0 1000
DOI 23 978 1001 16 29 0 977
ISSNs 0 1001 1001 9 19 0 51
Language 15 986 1001 2 7 0 4
Licence 6 995 1001 5 11 0 3
Publisher 0 1001 1001 7 47 0 6
Subjects 0 1001 1001 17 337 0 988
Title 0 1001 1001 18 318 0 1000
URL 0 1001 1001 57 57 0 1000
-- Variable type:Date ----------------------------------------------------------
variable missing complete n min max median n_unique
Date 0 1001 1001 2015-01-01 2015-01-12 2015-01-07 12
aka “faceting” in OpenRefine speak.
Two methods to generate a quick table of the languages represented in the dataframe: count() and forcats::fct_count. Since these data are primarily character, it’s helpful to learn about factor data and the forcats package. These two tables are the same. It looks like the data are published in English (spelled two different ways), FRench and Spanish.
crossref_data %>%
count(Language)
fct_count(crossref_data$Language, sort = TRUE)
This time, subset on the governing license. All but six articles are covered by a createive commons license.
crossref_data %>%
count(Licence)
Subset on the publisher. Sort in descending order.
crossref_data %>%
count(Publisher, sort = TRUE)
Subset by authors, and sort by the most prolific. This field appears to be a multi-valued field that is pipe | separated. How do we count and visualize how many articles have multiple authors?
crossref_data %>%
count(Authors, sort = TRUE)
The above table is not very useful (unless tracking publishing teams that are always expressed identically.) Let’s exploring some methods to generate a count of the pipe character separating each author in a single author field. The stringr::str_count() function is a great way to calculate the number of delimiters in each author field.
Note that counting a pipe character | requires using a Regular Expression, or regex. Anyone manipulating string characters with computers will be far more capable after spending some time learning about regular expressions. In this case the we’re looking for a pipe character |. The special trick, here, in understanding regex is to know that a pipe character has special meaning. Therefore we have to escape, or make it know that we want the literal pipe character and not the special meaning pipe character. To escape a character in regex one uses a backslash \. But the weird part is that, in R, one has to escape the the escape character: \\| means look for a literal |.
Below we count the number of pipe characters in each row of the Author field. Using the head function we only display the first six values (rows) in the Author column.
str_count(crossref_data$Authors, "\\|") %>% head()
[1] 1 1 2 3 2 3
Use dplyr::mutate to generate a new field that calculates how many authors each observation contains.
crossref_data %>%
select(Authors) %>%
mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>%
select(Authors, multi_authorship)
Visualize the frequency of multiple subject headings, just as with authors (A bar graph and a histogram)
crossref_data %>%
mutate(SH_count = str_count(Subjects, "\\|") + 1) %>%
mutate(SH_count = as.character(SH_count)) %>%
ggplot() +
aes(fct_infreq(SH_count)) +
geom_bar()
crossref_data %>%
mutate(SH_count = str_count(Subjects, "\\|") + 1) %>%
ggplot() +
aes(SH_count) +
geom_histogram(binwidth = 1)
Using dplyr, mutate a new variable and transform the data so that ‘EN’ and ‘English’ are the same. Transform ‘ES’ to “Spanish”, and ‘FR’ to “French”.
dplyr::case_when() is one specialized way to perform an if_else transformation.
crossref_data %>%
count(Language)
Since EN and English are synonymous, let’s combine them into a single value. case_when is a great function for collapsing values.
crossref_data <- crossref_data %>%
mutate(Language = case_when(
Language == "EN" ~ "English",
Language == "ES" ~ "Spanish",
Language == "FR" ~ "French"
))
Stacked Bar graph shows frequency by Language. Each stack of a bar distinguishes the publishers. English Language is huge and somewhat over-powers the reset of the graph. Make a second graph (below) to drill down on the lesser represented languages.
published_languages_bargraph <- crossref_data %>%
ggplot() +
aes(fct_infreq(Language), fill = Publisher) +
geom_bar()
published_languages_bargraph
Filter the data to show only the “NA”, “French”, and “Spanish”.
crossref_data %>%
filter(is.na(Language) | Language == "French" | Language == "Spanish") %>%
ggplot() +
aes(fct_infreq(Language), fill = Publisher) +
geom_bar() +
labs(title = "Published Languages",
subtitle = "NA or Non-English",
caption = "Data Source: Crossref.org")
published_over_time <- crossref_data %>%
count(Date) %>%
ggplot(aes(Date, n)) +
geom_point() +
geom_line() +
labs("Publishing Frequency by Day",
subtitle = "January, 2015")
published_over_time
Using Plottly’s ggplotly function, generate visualizations that are available for interactive mousing (i.e. subsetting and exploring). Gadgets such as sliders, drop-down menus, selection boxes and radio buttons are available and especially useful when combining library(crosstalk) with library(flexdashboards) as seen in the opening tab of this demonstration dashboard
ggplotly(published_languages_bargraph)
ggplotly(published_over_time)